Projet Python pour la data science

Ecole Nationale de la Statistique et de l'Administration Economique (ENSAE)


Recherche des pays potentiels en investissement dans l'éducation


    Réalisé par:
    • TIENDREBEOGO Mahamadou
    • SARE GNANISSO
    • Channdeth SOK
    Eléves Ingénieurs 2A de l'ENSAE IP Paris
  • Sous la supervision de:
  • Mme ANTUNEZ Kim
  • Enseignante à l'ENSAE

Introduction
¶

L'une des étapes importantes dans le processus d'investissement de toute entreprise est l'identification des marchés potentiels suceptibles de générer des retours sur investissement positif. Toute entreprise avant de se lancer dans un marché se doit d'étudier ce marché afin de déceller les opportunités présentes. Dans le cadre d'un investissement dans le domaine éducatif sur le plan international, il sied de savoir:.

  • Quels sont les pays avec un fort potentiel éducatif?
  • Pour chacun de ces pays, quelle sera l’évolution de ce potentiel éducatif ?
  • Dans quels pays une telle entreprise doit-elle opérer en priorité ?

Telles sont les questions fondamentales que nous allons resoudre dans notre travail. Il consiste à partir des données sur l'éducation de la banque mondiale méner une analyse afin de guider une entreprise dans son programme principalement d'expansion.

  • l’Analyse exploratoire pour juger de la pertinence des données ( Validation de la qualité de ce jeu de données)
  • Description des informations contenues dans le jeu de données
  • Sélection des informations qui semblent pertinentes pour répondre à la problématique
  • L'Analyse Statistiaque afin d'identifier les pays potentiels ( l'analyse en composante principale: ACP,...)

Importation des librairies nécessaires
¶

In [1]:
%matplotlib inline
import re
from os import path, getenv, environ
from typing import Dict

import folium
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats as stats
import seaborn as sns
from sklearn import set_config
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler


set_config(display="diagram", print_changed_only=False)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Chargement de la base et analyse preeliminaire
¶

In [2]:
# Recupération de la localisation des fichiers pour facilitr le travail
ROOT_DIR = path.dirname(path.realpath("__file__"))

ROOT_DIR
Out[2]:
'C:\\Users\\sareg\\Python Project'
In [4]:
# Importation de la base de données sous avec la fonction path.join qui prend en argument la localisation
# et le fichier des données
filepath = path.join(ROOT_DIR, "education_stats.csv")
# ce fichier recuperé  sera affecté à la varaible data et le separateur est point virgule.
data = pd.read_csv(filepath, sep=";")
# Utilisation de shape pour afficher la taille( nombres de ligne, nombre de colonnes) de data
print("The data shape is: {shape}".format(shape=data.shape))
The data shape is: (2904, 30)

Ainsi la base comporte 2904 observations(lignes) et 30 colonnes ( variables)

In [7]:
# Appelation de la fonction head pour afficher les premières lignes de data
data.head()
Out[7]:
Country Name Country Code Indicator Name Indicator Code 2020 Short Name Table Name Long Name 2-alpha code Currency Unit Special Notes Region Income Group WB-2 code System of National Accounts Alternative conversion factor PPP survey year Series Code Topic Short definition Long definition Unit of measure Periodicity Base Period Other notes Aggregation method Limitations and exceptions Notes from original source General comments Source
0 Arab World ARB Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN SE.XPD.TOTL.GB.ZS Expenditures NaN Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
1 Arab World ARB GDP per capita, PPP (constant 2011 international $) NY.GDP.PCAP.PP.KD 11450.86079 Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN NY.GDP.PCAP.PP.KD Economic Policy & Debt: Purchasing power parity NaN GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. NaN Annual 2011.0 NaN Weighted average NaN NaN NaN World Bank, International Comparison Program database.
2 Arab World ARB Government expenditure on post-secondary non-tertiary education as % of GDP (%) UIS.XGDP.4.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.4.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
3 Arab World ARB Government expenditure on secondary education as % of GDP (%) UIS.XGDP.23.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.23.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
4 Arab World ARB Government expenditure on tertiary education as % of GDP (%) UIS.XGDP.56.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.56.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics

La base de données contient plusieurs variables. Nous avons entre autres la variable indicator Name qui indique le pays correspondant, une variable qui renseigne sur les indicateurs. Ainsi pour chaque pays on donne un certains nombre d'indicateurs tels que l'utilisation de l'internet, les depenses du gouverneent dans l'education le PIB par tete. La valeur de l'indicateur est localisée dans la variable 2020. Ces données concernent l'année 2020. Par ailleurs, la base contient des données manquantes.

In [8]:
# Affichage des informations sur les variables( les types: float, object, character,...) 
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2904 entries, 0 to 2903
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country Name                   2904 non-null   object 
 1   Country Code                   2904 non-null   object 
 2   Indicator Name                 2904 non-null   object 
 3   Indicator Code                 2904 non-null   object 
 4   2020                           1588 non-null   float64
 5   Short Name                     2892 non-null   object 
 6   Table Name                     2892 non-null   object 
 7   Long Name                      2892 non-null   object 
 8   2-alpha code                   2856 non-null   object 
 9   Currency Unit                  2580 non-null   object 
 10  Special Notes                  1740 non-null   object 
 11  Region                         2568 non-null   object 
 12  Income Group                   2568 non-null   object 
 13  WB-2 code                      2880 non-null   object 
 14  System of National Accounts    2580 non-null   object 
 15  Alternative conversion factor  564 non-null    object 
 16  PPP survey year                1740 non-null   object 
 17  Series Code                    2662 non-null   object 
 18  Topic                          2662 non-null   object 
 19  Short definition               242 non-null    object 
 20  Long definition                2662 non-null   object 
 21  Unit of measure                0 non-null      float64
 22  Periodicity                    726 non-null    object 
 23  Base Period                    242 non-null    float64
 24  Other notes                    242 non-null    object 
 25  Aggregation method             726 non-null    object 
 26  Limitations and exceptions     484 non-null    object 
 27  Notes from original source     0 non-null      float64
 28  General comments               484 non-null    object 
 29  Source                         2662 non-null   object 
dtypes: float64(4), object(26)
memory usage: 680.8+ KB
In [18]:
# Stat descriptive des variables pour avoir une vue globale des variables
data.describe(include="all")
Out[18]:
Country Name Country Code Indicator Name Indicator Code 2020 Short Name Table Name Long Name 2-alpha code Currency Unit Special Notes Region Income Group WB-2 code System of National Accounts Alternative conversion factor PPP survey year Series Code Topic Short definition Long definition Unit of measure Periodicity Base Period Other notes Aggregation method Limitations and exceptions Notes from original source General comments Source
count 2904 2904 2904 2904 1.588000e+03 2892 2892 2892 2856 2580 1740 2568 2568 2880 2580 564 1740 2662 2662 242 2662 0.0 726 242.0 242 726 484 0.0 484 2662
unique 242 242 12 12 NaN 241 241 241 238 152 131 7 5 240 3 32 3 11 8 1 11 NaN 1 NaN 1 2 2 NaN 2 4
top Estonia TUN GDP per capita, PPP (constant 2011 international $) SE.XPD.TOTL.GB.ZS NaN Iraq Estonia Republic of Moldova RW Euro April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. Europe & Central Asia Upper middle income RW Country uses the 1993 System of National Accounts methodology. 1990–95 2005 SE.XPD.TOTL.GB.ZS Expenditures The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. NaN Annual NaN Cumulative Attainment Weighted average Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. NaN Please cite the International Telecommunication Union for third-party use of these data. UNESCO Institute for Statistics
freq 12 12 242 242 NaN 12 12 12 12 276 72 684 660 12 1980 96 1176 242 968 242 242 NaN 726 NaN 242 484 242 NaN 242 1936
mean NaN NaN NaN NaN 2.592391e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN 2.664621e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN 6.537010e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN 5.044388e+01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN 2.746509e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN 6.118075e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN

Les noms des variabales contiennent des espaces, il faut renommer pour faciliter la manipulation.

In [9]:
# Renommer les variables contenant des espaces avec la fonction replace ( espace devient tiret de huit)
# Rename columns: replace space & - by _
data.columns = data.columns.str.lower().str.replace('[-\s]', '_', regex=True)
data.head()
Out[9]:
country_name country_code indicator_name indicator_code 2020 short_name table_name long_name 2_alpha_code currency_unit special_notes region income_group wb_2_code system_of_national_accounts alternative_conversion_factor ppp_survey_year series_code topic short_definition long_definition unit_of_measure periodicity base_period other_notes aggregation_method limitations_and_exceptions notes_from_original_source general_comments source
0 Arab World ARB Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN SE.XPD.TOTL.GB.ZS Expenditures NaN Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
1 Arab World ARB GDP per capita, PPP (constant 2011 international $) NY.GDP.PCAP.PP.KD 11450.86079 Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN NY.GDP.PCAP.PP.KD Economic Policy & Debt: Purchasing power parity NaN GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. NaN Annual 2011.0 NaN Weighted average NaN NaN NaN World Bank, International Comparison Program database.
2 Arab World ARB Government expenditure on post-secondary non-tertiary education as % of GDP (%) UIS.XGDP.4.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.4.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
3 Arab World ARB Government expenditure on secondary education as % of GDP (%) UIS.XGDP.23.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.23.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
4 Arab World ARB Government expenditure on tertiary education as % of GDP (%) UIS.XGDP.56.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.56.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
In [10]:
# Vue globale sur les régions représentées à l'aide d'un diagramme circulaire( chaque région avec son nombre de pays: en pourcentage)
(data.region
 .value_counts(dropna=False)
 .plot(kind="pie", autopct='%.1f%%', legend = False, fontsize=10,
      xlabel="", ylabel="", table=False, figsize=(5, 5))
);

Toutes les zones sont représentées. Mais il existe une région nan qui contient 11,6% de pays. Ce sont des erreurs d'observations cela veut dire que ces pays n'ont pas de régions.

In [21]:
# Affichage des pays correspondant à des régions manquantes à l'aide de la fonction loc. Cette fonction nous affiche les pays sans région
data.loc[data.region.isnull(), "country_name"].unique()
Out[21]:
array(['Arab World', 'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)', 'European Union',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle income',
       'Middle East & North Africa',
       'Middle East & North Africa (excluding high income)',
       'Middle income', 'North America', 'OECD members', 'South Asia',
       'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)',
       'Upper middle income', 'World', 'British Virgin Islands',
       'Gibraltar', 'Nauru'], dtype=object)
In [24]:
# Suppression de tous les pays sans region( avec region manquante)
data = data.loc[data.region.notnull()]

PREPROCCESSING

Dans cette partie, il sera question d'effectuer une analyse globale de notre jeu de données. Nous allons nous interesser particulierement à:

  • L'analyse des doublons qui consiste à voir si nos individus existent de manière unique dans notre base.
  • L'analyse des valeurs manquantes qui consiste à voir s'il n'existe pas dans notre jeu de données des lignes ou des colonnes vides.
  • La visualisation de nos variables afin d'observer les interactions qui pourraient exister entre les variables
  • Le resumé stastique de nos variables
  • In [25]:
    data.describe(include="all")
    
    Out[25]:
    country_name country_code indicator_name indicator_code 2020 short_name table_name long_name 2_alpha_code currency_unit special_notes region income_group wb_2_code system_of_national_accounts alternative_conversion_factor ppp_survey_year series_code topic short_definition long_definition unit_of_measure periodicity base_period other_notes aggregation_method limitations_and_exceptions notes_from_original_source general_comments source
    count 2568 2568 2568 2568 1.436000e+03 2568 2568 2568 2532 2568 1440 2568 2568 2556 2568 564 1740 2354 2354 214 2354 0.0 642 214.0 214 642 428 0.0 428 2354
    unique 214 214 12 12 NaN 214 214 214 211 151 106 7 5 213 3 32 3 11 8 1 11 NaN 1 NaN 1 2 2 NaN 2 4
    top Estonia TUN Government expenditure on secondary education as % of GDP (%) SE.XPD.TOTL.GB.ZS NaN Estonia Estonia Republic of Iraq RW Euro April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. Europe & Central Asia Upper middle income RW Country uses the 1993 System of National Accounts methodology. 1990–95 2005 SE.XPD.TOTL.GB.ZS Expenditures The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. NaN Annual NaN Cumulative Attainment Weighted average Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. NaN Please cite the International Telecommunication Union for third-party use of these data. UNESCO Institute for Statistics
    freq 12 12 214 214 NaN 12 12 12 12 276 72 684 660 12 1980 96 1176 214 856 214 214 NaN 642 NaN 214 428 214 NaN 214 1712
    mean NaN NaN NaN NaN 4.246925e+06 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    std NaN NaN NaN NaN 4.586961e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
    min NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    25% NaN NaN NaN NaN 5.699508e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    50% NaN NaN NaN NaN 4.871977e+01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    75% NaN NaN NaN NaN 2.327380e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    max NaN NaN NaN NaN 1.262645e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    In [11]:
    # Observation de la distribution des valeurs manquantes par colonne (histogramme)
    msno.bar(data, figsize=(20, 7), fontsize=10);
    
    In [30]:
    #Distribution des valeurs manquantes par la fonction matrix: elle nous indique les positions des valeurs manquantes
    msno.matrix(data, figsize=(25, 10), fontsize=10);
    
    In [32]:
    ### Fréquence de la variable income group (histogramme par groupe de revenu)
    data.income_group.value_counts().plot(kind="barh");
    
    In [12]:
    # filtrer low income group: Ce sont des pays à RNB(Revenu national Brut) par habitant inferieur ou égal à 1000 dollars selon la 
    # banque mondiale
    # Nous allons recupérer ces pays et les écarter de l'analyse car ils n'ont pas de fort potentiel éducatif
    LOW_INCOME_FILTER = data.income_group.str.contains("low", regex=True, flags=re.I, na=False)
    data.loc[LOW_INCOME_FILTER, "country_name"].unique()
    
    Out[12]:
    array(['Afghanistan', 'Armenia', 'Bangladesh', 'Benin', 'Bhutan',
           'Bolivia', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia',
           'Cameroon', 'Central African Republic', 'Chad', 'Comoros',
           'Congo, Dem. Rep.', 'Congo, Rep.', "Cote d'Ivoire", 'Djibouti',
           'Egypt, Arab Rep.', 'El Salvador', 'Eritrea', 'Ethiopia',
           'Gambia, The', 'Georgia', 'Ghana', 'Guatemala', 'Guinea',
           'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'India',
           'Indonesia', 'Kenya', 'Kiribati', 'Korea, Dem. People’s Rep.',
           'Kosovo', 'Kyrgyz Republic', 'Lao PDR', 'Lesotho', 'Liberia',
           'Madagascar', 'Malawi', 'Mali', 'Mauritania',
           'Micronesia, Fed. Sts.', 'Moldova', 'Mongolia', 'Morocco',
           'Mozambique', 'Myanmar', 'Nepal', 'Nicaragua', 'Niger', 'Nigeria',
           'Pakistan', 'Papua New Guinea', 'Paraguay', 'Philippines',
           'Rwanda', 'Samoa', 'Sao Tome and Principe', 'Senegal',
           'Sierra Leone', 'Solomon Islands', 'Somalia', 'South Sudan',
           'Sri Lanka', 'Sudan', 'Swaziland', 'Syrian Arab Republic',
           'Tajikistan', 'Tanzania', 'Timor-Leste', 'Togo', 'Uganda',
           'Ukraine', 'Uzbekistan', 'Vanuatu', 'Vietnam',
           'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'],
          dtype=object)
    In [16]:
    # Ne garder que les pays n'ayant pas de faible RNB par habitant
    data = data.loc[~LOW_INCOME_FILTER]
    print(f"data shape: {data.shape}")
    data.head()
    
    data shape: (1896, 30)
    
    Out[16]:
    country_name country_code indicator_name indicator_code 2020 short_name table_name long_name 2_alpha_code currency_unit special_notes region income_group wb_2_code system_of_national_accounts alternative_conversion_factor ppp_survey_year series_code topic short_definition long_definition unit_of_measure periodicity base_period other_notes aggregation_method limitations_and_exceptions notes_from_original_source general_comments source
    0 Arab World ARB Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN SE.XPD.TOTL.GB.ZS Expenditures NaN Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
    1 Arab World ARB GDP per capita, PPP (constant 2011 international $) NY.GDP.PCAP.PP.KD 11450.86079 Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN NY.GDP.PCAP.PP.KD Economic Policy & Debt: Purchasing power parity NaN GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2011 international dollars. NaN Annual 2011.0 NaN Weighted average NaN NaN NaN World Bank, International Comparison Program database.
    2 Arab World ARB Government expenditure on post-secondary non-tertiary education as % of GDP (%) UIS.XGDP.4.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.4.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on post-secondary non-tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
    3 Arab World ARB Government expenditure on secondary education as % of GDP (%) UIS.XGDP.23.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.23.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on secondary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
    4 Arab World ARB Government expenditure on tertiary education as % of GDP (%) UIS.XGDP.56.FSGOV NaN Arab World Arab World Arab World 1A NaN Arab World aggregate. Arab World is composed of members of the League of Arab States. NaN NaN 1A NaN NaN NaN UIS.XGDP.56.FSGOV Expenditures NaN Total general (local, regional and central) government expenditure on tertiary education (current, capital, and transfers), expressed as a percentage of GDP. It includes expenditure funded by transfers from international sources to government. Divide total government expenditure for a given level of education (ex. primary, secondary, or all levels combined) by the GDP, and multiply by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN NaN NaN NaN NaN NaN NaN NaN UNESCO Institute for Statistics
    In [14]:
    data.describe(include="all")
    
    Out[14]:
    country_name country_code indicator_name indicator_code 2020 short_name table_name long_name 2_alpha_code currency_unit special_notes region income_group wb_2_code system_of_national_accounts alternative_conversion_factor ppp_survey_year series_code topic short_definition long_definition unit_of_measure periodicity base_period other_notes aggregation_method limitations_and_exceptions notes_from_original_source general_comments source
    count 1896 1896 1896 1896 1.037000e+03 1884 1884 1884 1860 1572 1140 1560 1560 1872 1572 240 996 1738 1738 158 1738 0.0 474 158.0 158 474 316 0.0 316 1738
    unique 158 158 12 12 NaN 157 157 157 155 87 86 7 3 156 3 16 3 11 8 1 11 NaN 1 NaN 1 2 2 NaN 2 4
    top Arab World ARB Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS NaN Arab World Arab World Arab World 1A Euro April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. Europe & Central Asia Upper middle income 1A Country uses the 1993 System of National Accounts methodology. 1987–95 Rolling SE.XPD.TOTL.GB.ZS Expenditures The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN Annual NaN Cumulative Attainment Weighted average Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. NaN Please cite the International Telecommunication Union for third-party use of these data. UNESCO Institute for Statistics
    freq 12 12 158 158 NaN 12 12 12 12 264 72 588 660 12 1248 48 444 158 632 158 158 NaN 474 NaN 158 316 158 NaN 158 1264
    mean NaN NaN NaN NaN 3.705797e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    std NaN NaN NaN NaN 3.274435e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
    min NaN NaN NaN NaN 1.670000e-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    25% NaN NaN NaN NaN 9.728620e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    50% NaN NaN NaN NaN 5.565753e+01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    75% NaN NaN NaN NaN 7.931259e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    max NaN NaN NaN NaN 6.118075e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    In [17]:
    #Ce sont les indicateurs qui nous serviront dans la suite, pour notre analyse
    #Affichage de la liste des indicateurs( noms et code par ordre croissant de code)
    (data
     .loc[:, ["indicator_name", "indicator_code"]]
     .drop_duplicates()
     .sort_values(by="indicator_code")
     .reset_index(drop=True)
    )
    
    Out[17]:
    indicator_name indicator_code
    0 Internet users (per 100 people) IT.NET.USER.P2
    1 GDP per capita, PPP (constant 2011 international $) NY.GDP.PCAP.PP.KD
    2 Gross enrolment ratio, upper secondary, both sexes (%) SE.SEC.ENRR.UP
    3 Gross enrolment ratio, tertiary, both sexes (%) SE.TER.ENRR
    4 Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS
    5 Population, ages 15-64 (% of total) SP.POP.1564.TO.ZS
    6 Population, total SP.POP.TOTL
    7 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total UIS.EA.3T6.AG25T99
    8 Gross enrolment ratio, post-secondary non-tertiary, both sexes (%) UIS.GER.4
    9 Government expenditure on secondary education as % of GDP (%) UIS.XGDP.23.FSGOV
    10 Government expenditure on post-secondary non-tertiary education as % of GDP (%) UIS.XGDP.4.FSGOV
    11 Government expenditure on tertiary education as % of GDP (%) UIS.XGDP.56.FSGOV
    In [20]:
    # description du nombre d'indicateurs par pays
    df_nb_indicators_by_country = data.groupby(["region", "country_name", "country_code"],
                                               as_index=False, dropna=False)["2020"].count()
    
    df_nb_indicators_by_country.describe(include="all")
    
    Out[20]:
    region country_name country_code 2020
    count 130 158 158 158.000000
    unique 7 158 158 NaN
    top Europe & Central Asia American Samoa ASM NaN
    freq 49 1 1 NaN
    mean NaN NaN NaN 6.563291
    std NaN NaN NaN 2.774850
    min NaN NaN NaN 0.000000
    25% NaN NaN NaN 5.000000
    50% NaN NaN NaN 6.000000
    75% NaN NaN NaN 8.750000
    max NaN NaN NaN 12.000000
    In [29]:
    df_nb_indicators_by_country.plot(kind="hist", title="Distribution - nombre  d'indicateur par pays");
    
    In [28]:
    # La carte des pays selon le nombre d'indicateurs disponible
    px.choropleth(df_nb_indicators_by_country,               
                  locations="country_code",               
                  color="2020",
                  # hover_name="country_name",
                  hover_data=["region", "country_name", "country_code"],
                  animation_frame=None,    
                  color_continuous_scale='Plasma',  
                  height=500,
                  width=None,
                  title="Number of indicator by country",
                  #range_color=(0, 12),
    )
    
    In [26]:
    # On va utiliser le quantile d'ordre 1(25%) du nombre d'indicateur disponoble pour filtrer les pays
    
    q1 = df_nb_indicators_by_country.quantile(.25).values[0]
    q1
    
    Out[26]:
    5.0
    In [30]:
    # countries that have less than 5 indicators, ces pays ne contiennent pas suffisament d'indicateurs pour l'analyse
    less_than_1rs_quartile_indicators = df_nb_indicators_by_country.loc[df_nb_indicators_by_country["2020"] <= q1,
                                                                        "country_name"]
    display(less_than_1rs_quartile_indicators)
    
    data = data.loc[~data.country_name.isin(less_than_1rs_quartile_indicators)]
    
    0                 American Samoa
    5               French Polynesia
    6                           Guam
    7           Hong Kong SAR, China
    12              Marshall Islands
    13                 New Caledonia
    15      Northern Mariana Islands
    16                         Palau
    17                     Singapore
    20                        Tuvalu
    22                       Andorra
    25                       Belarus
    27        Bosnia and Herzegovina
    29               Channel Islands
    35                 Faroe Islands
    40                     Greenland
    44                   Isle of Man
    48                 Liechtenstein
    52                        Monaco
    53                    Montenegro
    60                    San Marino
    61                        Serbia
    68                  Turkmenistan
    77                Cayman Islands
    82                       Curacao
    83                      Dominica
    86                       Grenada
    91                   Puerto Rico
    92     Sint Maarten (Dutch part)
    95      St. Martin (French part)
    97                      Suriname
    99      Turks and Caicos Islands
    102        Virgin Islands (U.S.)
    111                        Libya
    117         United Arab Emirates
    118                      Bermuda
    121                     Maldives
    125                        Gabon
    131       British Virgin Islands
    138                    Gibraltar
    150                        Nauru
    Name: country_name, dtype: object
    In [31]:
    data.describe(include="all")
    
    Out[31]:
    country_name country_code indicator_name indicator_code 2020 short_name table_name long_name 2_alpha_code currency_unit special_notes region income_group wb_2_code system_of_national_accounts alternative_conversion_factor ppp_survey_year series_code topic short_definition long_definition unit_of_measure periodicity base_period other_notes aggregation_method limitations_and_exceptions notes_from_original_source general_comments source
    count 1404 1404 1404 1404 9.150000e+02 1404 1404 1404 1392 1104 972 1104 1104 1392 1104 192 900 1287 1287 117 1287 0.0 351 117.0 117 351 234 0.0 234 1287
    unique 117 117 12 12 NaN 117 117 117 116 72 74 6 3 116 3 13 3 11 8 1 11 NaN 1 NaN 1 2 2 NaN 2 4
    top Arab World ARB Expenditure on education as % of total government expenditure (%) SE.XPD.TOTL.GB.ZS NaN Arab World Arab World Arab World 1A Euro April 2012 database update: Based on official government statistics, national accounts data were revised for 2000 onward; the base year changed to 2006. Europe & Central Asia Upper middle income 1A Country uses the 1993 System of National Accounts methodology. 1987–95 Rolling SE.XPD.TOTL.GB.ZS Expenditures The percentage of population (age 25 and over) with at least completed upper secondary education (ISCED 3 or higher). This indicator is calculated by dividing the number of persons aged 25 years and above with completed upper secondary education by the total population of the same age group and multiplying the result by 100. The UNESCO Institute for Statistics (UIS) educational attainment dataset shows the educational composition of the population aged 25 years and above and hence the stock and quality of human capital within a country. The dataset also reflects the structure and performance of the education system and its accumulated impact on human capital formation. For more information, visit the UNESCO Institute for Statistics website: http://www.uis.unesco.org/ Total general (local, regional and central) government expenditure on education (current, capital, and transfers), expressed as a percentage of total general government expenditure on all sectors (including health, education, social services, etc.). It includes expenditure funded by transfers from international sources to government. Public education expenditure includes spending by local/municipal, regional and national governments (excluding household contributions) on educational institutions (both public and private), education administration, and subsidies for private entities (students/households and other privates entities). In some instances data on total public expenditure on education refers only to the ministry of education and can exclude other ministries that spend a part of their budget on educational activities. The indicator is calculated by dividing total public expenditure on education incurred by all government agencies/departments by the total government expenditure and multiplying by 100. For more information, consult the UNESCO Institute of Statistics website: http://www.uis.unesco.org/Education/ NaN Annual NaN Cumulative Attainment Weighted average Operators have traditionally been the main source of telecommunications data, so information on subscriptions has been widely available for most countries. This gives a general idea of access, but a more precise measure is the penetration rate - the share of households with access to telecommunications. During the past few years more information on information and communication technology use has become available from household and business surveys. Also important are data on actual use of telecommunications services. Ideally, statistics on telecommunications (and other information and communications technologies) should be compiled for all three measures: subscriptions, access, and use. The quality of data varies among reporting countries as a result of differences in regulations covering data provision and availability.\n\nDiscrepancies may also arise in cases where the end of a fiscal year differs from that used by ITU, which is the end of December of every year. A number of countries have fiscal years that end in March or June of every year. NaN Please cite the International Telecommunication Union for third-party use of these data. UNESCO Institute for Statistics
    freq 12 12 117 117 NaN 12 12 12 12 204 48 432 480 12 936 48 408 117 468 117 117 NaN 351 NaN 117 234 117 NaN 117 936
    mean NaN NaN NaN NaN 4.194025e+07 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    std NaN NaN NaN NaN 3.483212e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
    min NaN NaN NaN NaN 1.670000e-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    25% NaN NaN NaN NaN 8.327495e+00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    50% NaN NaN NaN NaN 5.159572e+01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    75% NaN NaN NaN NaN 1.917860e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    max NaN NaN NaN NaN 6.118075e+09 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2011.0 NaN NaN NaN NaN NaN NaN
    In [32]:
    sns.catplot(data=data, y='region', x='2020', col='indicator_code',
                height=5, aspect=1.5, palette="husl", orient="h",
                col_wrap=2, kind="box", sharex=False, sharey=True);
    
    In [33]:
    # topn (n=10) of country by indicator: On a ainsi le top des 10 pays par indicateur
    TOPN = 10
    df_country_by_indicator = (data
                               .groupby(['indicator_name'], as_index=True, dropna=False)
                               [['region', 'country_name', '2020']]
                               .apply(lambda x: x.nlargest(TOPN, columns=['2020'], keep="all"))
                               .reset_index()
                               )
    
    df_country_by_indicator
    
    Out[33]:
    indicator_name level_1 region country_name 2020
    0 Expenditure on education as % of total government expenditure (%) 2496 Latin America & Caribbean St. Vincent and the Grenadines 30.789200
    1 Expenditure on education as % of total government expenditure (%) 2604 East Asia & Pacific Thailand 28.388599
    2 Expenditure on education as % of total government expenditure (%) 2664 Middle East & North Africa Tunisia 25.047310
    3 Expenditure on education as % of total government expenditure (%) 2640 East Asia & Pacific Tonga 22.955860
    4 Expenditure on education as % of total government expenditure (%) 1932 Sub-Saharan Africa Namibia 21.925060
    ... ... ... ... ... ...
    115 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total 2291 Middle East & North Africa Saudi Arabia 30.066601
    116 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total 503 Latin America & Caribbean Barbados 24.205730
    117 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total 1787 Middle East & North Africa Malta 20.398439
    118 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total 1823 Sub-Saharan Africa Mauritius 20.398439
    119 UIS: Percentage of population age 25+ with at least completed upper secondary education (ISCED 3 or higher). Total 1835 Latin America & Caribbean Mexico 18.577311

    120 rows × 5 columns

    In [34]:
    #Plot des pays dans le top10
    g = sns.catplot(data=df_country_by_indicator,
                y="country_name",
                x="2020",
                kind="bar",
                orient="h",
                height=5,
                aspect=1.7,
                hue="region",
                col="indicator_name",
                col_wrap=2,
                sharey=False,
                sharex=False,
               )
    g.set_axis_labels("", "")
    
    Out[34]:
    <seaborn.axisgrid.FacetGrid at 0x25aa1b75250>
    In [36]:
    # Nuage de poins des pays pour chaque variable
    sns.catplot(data=data, 
                x="2020",
                col="indicator_code",
                kind="strip",
                orient="h",
                height=4,
                aspect=1.5,
                col_wrap=2,
                sharex=False,
                color="blue",
               );
    

    Remarques¶

    La disposition des variables ne permet pas une bonne analyse statistique car ce sont les indicateurs qui constituent nos variables danalyse. Il faut donc que les indicateurs soient en colonne et les pays en ligne. Ainsi la commande pivot de la balise pandas sera utilisée. Elle fonctionne exactement comme la commande reshape sur stata. Ainsi nous obtenons une base qui nous renseigne pour chaque pays ou zone les différents indicateurs. La commande pivot nous a facilité ici la tache au lieu de procéder à des manipulation longues. Implicitement, nous allons éliminer plusieurs colonnes qui ne contenaient jusque là que des métadonnées(des informations sur les calcul des indicateurs).

    In [48]:
    # Create a spreadsheet-style pivot table as a DataFrame.
    data_pivot = pd.pivot_table(data=data,
                                index=["region", "country_name", "country_code"],
                                values='2020',
                                columns='indicator_code').reset_index()
    print(f"Data shape: {data_pivot.shape}")
    data_pivot.head()
    
    Data shape: (92, 15)
    
    Out[48]:
    indicator_code region country_name country_code IT.NET.USER.P2 NY.GDP.PCAP.PP.KD SE.SEC.ENRR.UP SE.TER.ENRR SE.XPD.TOTL.GB.ZS SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV
    0 East Asia & Pacific Australia AUS 46.756116 35281.395362 249.607498 67.03891 13.360160 66.788576 1.915300e+07 NaN 61.390049 1.88884 0.07649 1.14633
    1 East Asia & Pacific Brunei Darussalam BRN 8.996285 82049.580860 67.462151 12.69113 8.920740 66.955147 3.332410e+05 NaN 1.296230 NaN NaN NaN
    2 East Asia & Pacific China CHN 1.775913 3700.743648 38.959030 7.72048 NaN 68.462574 1.262645e+09 NaN 5.307240 NaN NaN NaN
    3 East Asia & Pacific Fiji FJI 1.496855 6673.847391 59.116379 NaN 20.484921 61.538812 8.112230e+05 NaN 2.885000 NaN NaN 0.84457
    4 East Asia & Pacific Japan JPN 29.990740 33871.843545 100.454224 48.73653 9.930410 68.232188 1.268430e+08 NaN 0.844730 1.44606 NaN 0.54858
    In [49]:
    # Description de la nouvelle base
    data_pivot.describe(include="all")
    
    Out[49]:
    indicator_code region country_name country_code IT.NET.USER.P2 NY.GDP.PCAP.PP.KD SE.SEC.ENRR.UP SE.TER.ENRR SE.XPD.TOTL.GB.ZS SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV
    count 92 92 92 91.000000 90.000000 84.000000 69.000000 63.000000 91.000000 9.200000e+01 11.000000 56.000000 46.000000 25.000000 47.000000
    unique 6 92 92 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    top Europe & Central Asia Australia AUS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    freq 36 1 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
    mean NaN NaN NaN 13.456444 23297.936342 81.920940 38.041727 14.193053 64.749554 3.592016e+07 36.865614 16.628987 1.814862 0.087523 0.983590
    std NaN NaN NaN 14.729495 18997.157329 35.704867 19.611261 5.344908 4.236421 1.359502e+08 21.497669 14.870619 0.587587 0.108619 0.455717
    min NaN NaN NaN 0.105046 3508.905125 8.165120 3.234750 4.964300 50.130735 4.537400e+04 15.675070 0.335280 0.397710 0.001670 0.059390
    25% NaN NaN NaN 3.180408 10260.927062 59.539659 22.362659 10.375395 62.227768 1.645501e+06 20.398439 5.921795 1.420477 0.023290 0.741090
    50% NaN NaN NaN 6.731396 15654.111790 80.134777 37.147419 13.360160 65.567704 6.736625e+06 30.066601 12.091540 1.831825 0.047550 0.869290
    75% NaN NaN NaN 17.225234 34689.708142 98.578098 54.428761 16.599045 67.859249 2.379614e+07 50.060844 26.575469 2.111583 0.127030 1.214570
    max NaN NaN NaN 52.000000 108323.903919 249.607498 82.439072 30.789200 72.588714 1.262645e+09 71.459663 61.390049 2.970210 0.536130 2.425410
    In [50]:
    #Visualisons à nouveau les valeurs manquantes de la base
    msno.bar(data_pivot);
    
    In [51]:
    msno.matrix(data_pivot);
    

    Recupération des indicateurs / Création de nouvelles variables¶

    In [61]:
    #liste des variables(indicteurs) de data_pivot
    indicators = ['IT.NET.USER.P2',  # internet
                  "NY.GDP.PCAP.PP.KD", # GDP
                  "SP.POP.1564.TO.ZS", # population
                  # enrolment
                  "SE.SEC.ENRR.UP",
                  "SE.TER.ENRR",
                  "UIS.GER.4",
                  # expenditure
                  "SE.XPD.TOTL.GB.ZS",
                  "UIS.XGDP.23.FSGOV",
                  "UIS.XGDP.4.FSGOV",
                  "UIS.XGDP.56.FSGOV",
                 ]
    len(indicators)
    
    Out[61]:
    10
    In [63]:
    # Création de nouvelles variables
    # total_enrolment= moyenne des trois taux de scolarisation( secondaire,post-secondaire non-supérieur, supérieur)
    # gov_expenditure= somme des dépenses du gouvernement( secondaire,post-secondaire non-supérieur, supérieur)
    # Renommer les variables(indicateurs) SE.XPD.TOTL.GB.ZS ; IT.NET.USER.P2 ; NY.GDP.PCAP.PP.KD .
    data_pivot = (data_pivot
                  .assign(total_enrolment=lambda dframe: dframe[["SE.SEC.ENRR.UP", "SE.TER.ENRR", "UIS.GER.4"]
                                                               ].mean(axis=1),
                          gov_expenditure=lambda dframe: (dframe["UIS.XGDP.23.FSGOV"]
                                                          + dframe["UIS.XGDP.4.FSGOV"]
                                                          + dframe["UIS.XGDP.56.FSGOV"]
                                                         ),
                         )
                  .rename(columns={"SE.XPD.TOTL.GB.ZS": "percent_education_total_gov_expenditure",
                                   "IT.NET.USER.P2": "internet_users",
                                   "NY.GDP.PCAP.PP.KD": "gdp_per_capita"})
                 )
    data_pivot.head()
    
    Out[63]:
    indicator_code region country_name country_code internet_users gdp_per_capita SE.SEC.ENRR.UP SE.TER.ENRR percent_education_total_gov_expenditure SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV total_enrolment gov_expenditure
    0 East Asia & Pacific Australia AUS 46.756116 35281.395362 249.607498 67.03891 13.360160 66.788576 1.915300e+07 NaN 61.390049 1.88884 0.07649 1.14633 126.012152 3.11166
    1 East Asia & Pacific Brunei Darussalam BRN 8.996285 82049.580860 67.462151 12.69113 8.920740 66.955147 3.332410e+05 NaN 1.296230 NaN NaN NaN 27.149837 NaN
    2 East Asia & Pacific China CHN 1.775913 3700.743648 38.959030 7.72048 NaN 68.462574 1.262645e+09 NaN 5.307240 NaN NaN NaN 17.328917 NaN
    3 East Asia & Pacific Fiji FJI 1.496855 6673.847391 59.116379 NaN 20.484921 61.538812 8.112230e+05 NaN 2.885000 NaN NaN 0.84457 31.000689 NaN
    4 East Asia & Pacific Japan JPN 29.990740 33871.843545 100.454224 48.73653 9.930410 68.232188 1.268430e+08 NaN 0.844730 1.44606 NaN 0.54858 50.011828 NaN
    In [64]:
    # Visualisation ( carte des variables )
    url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
    country_geo = f'{url}/world-countries.json'
    
    In [65]:
    for col_name in ['internet_users',
                     "gdp_per_capita",
                     "total_enrolment",
                     "gov_expenditure",
                     "percent_education_total_gov_expenditure",
                    ]:
        m = folium.Map(location=[0, 0], zoom_start=1)
        folium.Choropleth(geo_data=country_geo,
                      data=data_pivot,
                      columns=['country_code', col_name],
                      key_on='feature.id',
                      name="choropleth",
                      fill_color='BuPu',
                      fill_opacity=0.7,
                      line_opacity=0.2,
                      legend_name=col_name,
                      nan_fill_color="black",
                      highlight=True,
                     ).add_to(m)
        display(m)
    
    Make this Notebook Trusted to load map: File -> Trust Notebook
    Make this Notebook Trusted to load map: File -> Trust Notebook
    Make this Notebook Trusted to load map: File -> Trust Notebook
    Make this Notebook Trusted to load map: File -> Trust Notebook
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [57]:
    # Affichage des premières lignes de data_pivot
    data_pivot.head()
    
    Out[57]:
    indicator_code region country_name country_code internet_users gdp_per_capita SE.SEC.ENRR.UP SE.TER.ENRR percent_education_total_gov_expenditure SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV total_enrolment gov_expenditure
    0 East Asia & Pacific Australia AUS 46.756116 35281.395362 249.607498 67.03891 13.360160 66.788576 1.915300e+07 NaN 61.390049 1.88884 0.07649 1.14633 126.012152 3.11166
    1 East Asia & Pacific Brunei Darussalam BRN 8.996285 82049.580860 67.462151 12.69113 8.920740 66.955147 3.332410e+05 NaN 1.296230 NaN NaN NaN 27.149837 NaN
    2 East Asia & Pacific China CHN 1.775913 3700.743648 38.959030 7.72048 NaN 68.462574 1.262645e+09 NaN 5.307240 NaN NaN NaN 17.328917 NaN
    3 East Asia & Pacific Fiji FJI 1.496855 6673.847391 59.116379 NaN 20.484921 61.538812 8.112230e+05 NaN 2.885000 NaN NaN 0.84457 31.000689 NaN
    4 East Asia & Pacific Japan JPN 29.990740 33871.843545 100.454224 48.73653 9.930410 68.232188 1.268430e+08 NaN 0.844730 1.44606 NaN 0.54858 50.011828 NaN

    Scoring/ la partie décisionnelle pour l'investissement¶

    Dans cette partie, nous allons construire un score d'attractivité. Celui-ci nous permettra d'ordonner les pays en fonction de leur attractivité en terme d'investissement dans l'éducation.

    Pour ce faire nous allons utiliser deux methodes, la première est une pondération et la deuxième est une analyse en composante principale.

    In [99]:
    # Ici on s'intéresse aux variables clés : le nombre d'utilisateurs d'internet, le PIB par habitant, 
    # le taux moyen de scolarisation et les dépenses totales du gouvernement dans l'éducation(en %).
    # La liste des indicateurs pour le score d'attractivité
    
    list_features = ["internet_users",
                     "gdp_per_capita",
                     "total_enrolment",
                     "percent_education_total_gov_expenditure",
                    ]
    
    In [100]:
    # Définition de la fonction Scoring
    def scoring(dframe: pd.DataFrame,
               features_weight: Dict[str, float]
               ) -> float:
        """ Country attractivity score
        
        Args:
            dframe (pd.DataFrame): data frame
        
        Returns:
            float: attractivity score
        
        """
        score = 0
        for col_name, weight in features_weight.items():
            score += weight * dframe[col_name]
            
        return score
    

    Méthode de Pondération¶

    In [101]:
    # internet users stats
    internet_stat = data_pivot["internet_users"].describe(percentiles=np.arange(0, 1, 0.1))
    internet_stat
    
    Out[101]:
    count    92.000000
    mean     13.355744
    std      14.680149
    min       0.105046
    0%        0.105046
    10%       1.511643
    20%       2.774729
    30%       3.693737
    40%       5.542655
    50%       6.688139
    60%       8.597771
    70%      14.869559
    80%      23.066165
    90%      42.688489
    max      52.000000
    Name: internet_users, dtype: float64
    In [102]:
    # filter out all coutries that have internet_users less thant 80% decile
    data_custom = data_pivot.loc[data_pivot["internet_users"] >= internet_stat.loc["70%"], :].copy()
    data_custom.country_name.unique()
    
    Out[102]:
    array(['Australia', 'Japan', 'Korea, Rep.', 'Malaysia', 'New Zealand',
           'Austria', 'Belgium', 'Cyprus', 'Denmark', 'Estonia', 'Finland',
           'Germany', 'Iceland', 'Ireland', 'Italy', 'Luxembourg',
           'Netherlands', 'Norway', 'Portugal', 'Slovenia', 'Sweden',
           'Switzerland', 'United Kingdom', 'Aruba', 'Chile', 'Israel',
           'Canada', 'United States'], dtype=object)
    In [103]:
    data_custom.head()
    
    Out[103]:
    indicator_code region country_name country_code internet_users gdp_per_capita SE.SEC.ENRR.UP SE.TER.ENRR percent_education_total_gov_expenditure SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV total_enrolment gov_expenditure
    0 East Asia & Pacific Australia AUS 46.756116 35281.395362 249.607498 67.038910 13.360160 66.788576 19153000.0 NaN 61.390049 1.88884 0.07649 1.14633 126.012152 3.11166
    4 East Asia & Pacific Japan JPN 29.990740 33871.843545 100.454224 48.736530 9.930410 68.232188 126843000.0 NaN 0.844730 1.44606 NaN 0.54858 50.011828 NaN
    5 East Asia & Pacific Korea, Rep. KOR 44.700000 20756.779897 95.884689 78.436157 20.484921 72.199592 47008111.0 63.714600 NaN NaN NaN NaN 87.160423 NaN
    7 East Asia & Pacific Malaysia MYS 21.384731 16309.967988 45.810822 25.742260 21.390680 62.719942 23185608.0 36.407089 19.770599 2.06063 0.17907 1.91465 30.441227 4.15435
    8 East Asia & Pacific New Zealand NZL 47.379557 28264.565116 123.198380 66.192177 20.484921 65.468325 3857700.0 NaN 33.464970 2.54104 0.08818 NaN 74.285175 NaN
    In [104]:
    # higher value, higher percentile rank
    
    data_custom.loc[:, list_features] = data_custom.loc[:, list_features
                                                       ].rank(pct=True, ascending=True, na_option="keep") * 100
    
    data_custom.head()
    
    Out[104]:
    indicator_code region country_name country_code internet_users gdp_per_capita SE.SEC.ENRR.UP SE.TER.ENRR percent_education_total_gov_expenditure SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV total_enrolment gov_expenditure
    0 East Asia & Pacific Australia AUS 85.714286 53.571429 249.607498 67.038910 57.142857 66.788576 19153000.0 NaN 61.390049 1.88884 0.07649 1.14633 100.000000 3.11166
    4 East Asia & Pacific Japan JPN 50.000000 42.857143 100.454224 48.736530 7.142857 68.232188 126843000.0 NaN 0.844730 1.44606 NaN 0.54858 21.428571 NaN
    5 East Asia & Pacific Korea, Rep. KOR 78.571429 17.857143 95.884689 78.436157 91.071429 72.199592 47008111.0 63.714600 NaN NaN NaN NaN 89.285714 NaN
    7 East Asia & Pacific Malaysia MYS 28.571429 14.285714 45.810822 25.742260 96.428571 62.719942 23185608.0 36.407089 19.770599 2.06063 0.17907 1.91465 3.571429 4.15435
    8 East Asia & Pacific New Zealand NZL 92.857143 32.142857 123.198380 66.192177 91.071429 65.468325 3857700.0 NaN 33.464970 2.54104 0.08818 NaN 78.571429 NaN
    In [105]:
    # Il s'agit d'affecter des coefficients( poids) à chaque variable en fonction de son importance que nous considerons.
    # custom weighted
    
    weigthed = {"internet_users": .4,
                "total_enrolment": .25,
                "gdp_per_capita": .25,
                "percent_education_total_gov_expenditure": .1,
               }
    
    
    data_custom = (data_custom
                   .assign(attractivity_score_custom=
                           lambda dframe: dframe.apply(scoring, features_weight=weigthed, axis=1))
                   .reset_index()
                  )
    data_custom[list_features + ["attractivity_score_custom"]].head()
    
    Out[105]:
    indicator_code internet_users gdp_per_capita total_enrolment percent_education_total_gov_expenditure attractivity_score_custom
    0 85.714286 53.571429 100.000000 57.142857 78.392857
    1 50.000000 42.857143 21.428571 7.142857 36.785714
    2 78.571429 17.857143 89.285714 91.071429 67.321429
    3 28.571429 14.285714 3.571429 96.428571 25.535714
    4 92.857143 32.142857 78.571429 91.071429 73.928571
    In [106]:
    top_country_custom = data_custom.nlargest(TOPN, columns=["attractivity_score_custom"], keep="all")
    g = sns.catplot(data=top_country_custom,
                    kind="bar", orient="h", hue=None,
                    y="country_name", x="attractivity_score_custom", estimator=np.mean,
                    height=5, aspect=1.4, palette="Greens_r",
                   )
    g.ax.set_title("Selected countries");
    

    Méthode Statistique ACP¶

    Dans cette partie nous allons mener une méthode d'analyse en composante principale

    In [107]:
    fig = px.scatter_3d(data_frame=data_pivot,
                        x="internet_users",
                        y="gdp_per_capita",
                        z='total_enrolment',
                        color='region',
                        hover_name="country_name",
                        symbol='region',
                        opacity=0.7,
                        width=800,
                        height=500,
                       )
    # tight layout
    fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
    
    In [110]:
    # Définition de la fonction fillna( elle permettra d'imputer les valeurs manquantes par la médiane)
    #
    
    def fillna_by_group(data, groupby, columns, estimator="median"):
        """ Fill na by group
        
        Args:
            data (pd.DataFrame):
            groupby (Union[str, Sequence[str]]):
            columns (Union[str, Sequence[str]]):
            estimator (str):
        
        Return:
            pd.DataFrame
        
        """
        return data_pivot[columns].fillna(data_pivot.groupby(groupby)[columns].transform(estimator))
    
    In [109]:
    # Imputer les valeurs manquantes
    data_pivot[list_features] = fillna_by_group(data=data_pivot,
                                                groupby="region",
                                                columns=list_features,
                                                estimator="median",
                                               )
    data_pivot.head()
    
    Out[109]:
    indicator_code region country_name country_code internet_users gdp_per_capita SE.SEC.ENRR.UP SE.TER.ENRR percent_education_total_gov_expenditure SP.POP.1564.TO.ZS SP.POP.TOTL UIS.EA.3T6.AG25T99 UIS.GER.4 UIS.XGDP.23.FSGOV UIS.XGDP.4.FSGOV UIS.XGDP.56.FSGOV total_enrolment gov_expenditure
    0 East Asia & Pacific Australia AUS 46.756116 35281.395362 249.607498 67.03891 13.360160 66.788576 1.915300e+07 NaN 61.390049 1.88884 0.07649 1.14633 126.012152 3.11166
    1 East Asia & Pacific Brunei Darussalam BRN 8.996285 82049.580860 67.462151 12.69113 8.920740 66.955147 3.332410e+05 NaN 1.296230 NaN NaN NaN 27.149837 NaN
    2 East Asia & Pacific China CHN 1.775913 3700.743648 38.959030 7.72048 20.484921 68.462574 1.262645e+09 NaN 5.307240 NaN NaN NaN 17.328917 NaN
    3 East Asia & Pacific Fiji FJI 1.496855 6673.847391 59.116379 NaN 20.484921 61.538812 8.112230e+05 NaN 2.885000 NaN NaN 0.84457 31.000689 NaN
    4 East Asia & Pacific Japan JPN 29.990740 33871.843545 100.454224 48.73653 9.930410 68.232188 1.268430e+08 NaN 0.844730 1.44606 NaN 0.54858 50.011828 NaN
    In [121]:
    # Standardisation de la base
    scaled_data = pd.DataFrame(MinMaxScaler().fit_transform(data_pivot[list_features]),
                               columns=list_features,
                               index=pd.MultiIndex.from_frame(data_pivot[["region", "country_name", "country_code"]]))
    scaled_data
    
    Out[121]:
    internet_users gdp_per_capita total_enrolment percent_education_total_gov_expenditure
    region country_name country_code
    East Asia & Pacific Australia AUS 0.898952 0.303129 1.000000 0.325107
    Brunei Darussalam BRN 0.171331 0.749327 0.179070 0.153203
    China CHN 0.032197 0.001830 0.097519 0.600994
    Fiji FJI 0.026820 0.030196 0.211046 0.600994
    Japan JPN 0.575888 0.289681 0.368910 0.192299
    ... ... ... ... ... ... ...
    Sub-Saharan Africa Equatorial Guinea GNQ 0.000526 0.069492 0.000954 0.362172
    Mauritius MUS 0.138289 0.072206 0.281844 0.362172
    Namibia NAM 0.029669 0.024779 0.088199 0.656760
    Seychelles SYC 0.140487 0.142577 0.286187 0.362172
    South Africa ZAF 0.101041 0.059216 0.321362 0.362172

    92 rows × 4 columns

    Avant de réaliser l'ACP nous allons effectuer le test de sphéricité de Bartlett

    In [122]:
    # le test de Bartlett.
    import scipy
    from scipy.stats import bartlett 
    p = scipy.stats.bartlett( scaled_data.internet_users,scaled_data.gdp_per_capita,scaled_data.total_enrolment,scaled_data.percent_education_total_gov_expenditure)
    p
    
    Out[122]:
    BartlettResult(statistic=32.704415217275425, pvalue=3.7178129709993507e-07)
    In [124]:
    # Application de l'ACP sur nos variables
    pca = PCA(n_components=None)
    pca.fit(scaled_data)   
    explain_var_ratio = pca.explained_variance_ratio_
    explain_var_ratio
    
    Out[124]:
    array([0.59491999, 0.20313471, 0.11460959, 0.08733571])
    In [125]:
    # weighted: Variance expliquee de chaque composante multipliée par la coordonnee de chaque variable
    weighted = sum(np.multiply(explain_var_ratio.reshape(1, -1).T, pca.components_))
    # normalized weight: sum of weigth will be equal to 1
    normalize_weighted =  (weighted / sum(weighted)).round(2)
    features_weight_stat = dict(zip(list_features, normalize_weighted))
    
    features_weight_stat
    
    Out[125]:
    {'internet_users': 0.43,
     'gdp_per_capita': 0.23,
     'total_enrolment': 0.2,
     'percent_education_total_gov_expenditure': 0.15}
    In [126]:
    data_pivot[list_features]
    
    Out[126]:
    indicator_code internet_users gdp_per_capita total_enrolment percent_education_total_gov_expenditure
    0 46.756116 35281.395362 126.012152 13.360160
    1 8.996285 82049.580860 27.149837 8.920740
    2 1.775913 3700.743648 17.328917 20.484921
    3 1.496855 6673.847391 31.000689 20.484921
    4 29.990740 33871.843545 50.011828 9.930410
    ... ... ... ... ...
    87 0.132355 10792.733023 5.699935 14.317360
    88 7.281535 11077.147618 39.526619 14.317360
    89 1.644740 6106.108164 16.206575 21.925060
    90 7.395629 18453.109771 40.049661 14.317360
    91 5.348560 9715.620644 44.285725 14.317360

    92 rows × 4 columns

    In [127]:
    scaled_data = (scaled_data
                   .assign(attractivity_score=
                           lambda dframe: dframe.apply(scoring, features_weight=features_weight_stat, axis=1))
                   .reset_index()
                  )
    scaled_data
    
    Out[127]:
    region country_name country_code internet_users gdp_per_capita total_enrolment percent_education_total_gov_expenditure attractivity_score
    0 East Asia & Pacific Australia AUS 0.898952 0.303129 1.000000 0.325107 0.705035
    1 East Asia & Pacific Brunei Darussalam BRN 0.171331 0.749327 0.179070 0.153203 0.304812
    2 East Asia & Pacific China CHN 0.032197 0.001830 0.097519 0.600994 0.123919
    3 East Asia & Pacific Fiji FJI 0.026820 0.030196 0.211046 0.600994 0.150836
    4 East Asia & Pacific Japan JPN 0.575888 0.289681 0.368910 0.192299 0.416886
    ... ... ... ... ... ... ... ... ...
    87 Sub-Saharan Africa Equatorial Guinea GNQ 0.000526 0.069492 0.000954 0.362172 0.070726
    88 Sub-Saharan Africa Mauritius MUS 0.138289 0.072206 0.281844 0.362172 0.186766
    89 Sub-Saharan Africa Namibia NAM 0.029669 0.024779 0.088199 0.656760 0.134611
    90 Sub-Saharan Africa Seychelles SYC 0.140487 0.142577 0.286187 0.362172 0.204765
    91 Sub-Saharan Africa South Africa ZAF 0.101041 0.059216 0.321362 0.362172 0.175665

    92 rows × 8 columns

    In [128]:
    top_country = scaled_data.nlargest(TOPN, columns=["attractivity_score"], keep="all")
    g = sns.catplot(data=top_country,
                    kind="bar", orient="h", hue=None,
                    y="country_name", x="attractivity_score", estimator=np.mean,
                    height=5, aspect=1.4, palette="Greens_r",
                   )
    g.ax.set_title("Selected countries") ;
    
    In [129]:
    (pd.melt(top_country,
            id_vars=["country_name", "region"],
            value_vars=list_features + ["attractivity_score"],
            var_name=None,
            value_name='value',)
     .pipe((sns.catplot, "data"), kind="swarm",
           y="country_name", x="value", hue="variable", height=5, aspect=1.5
          )
    );
    
    In [130]:
    g = sns.PairGrid(top_country,
                     x_vars=list_features, y_vars=["country_name"],
                     height=6, aspect=.5,
                    )
    
    # Stripplot
    g.map(sns.stripplot, size=10, orient="h", jitter=False,
          palette="flare_r", linewidth=1, edgecolor="w")
    
    
    titles = list_features
    
    # Labels
    g.set(xlabel="", ylabel="")
    
    for ax, title in zip(g.axes.flat, titles):
    
        # Set a different title for each axes
        ax.set(title=title)
        ax.set_xticklabels([])
    
        # Make the grid horizontal instead of vertical
        ax.xaxis.grid(False)
        ax.yaxis.grid(True)
    sns.despine(left=True, bottom=True)
    
    In [131]:
    # top of region
    scaled_data.groupby("region").attractivity_score.mean().nlargest(n=10)
    
    Out[131]:
    region
    North America                 0.620367
    East Asia & Pacific           0.366177
    Europe & Central Asia         0.332619
    Middle East & North Africa    0.240436
    Latin America & Caribbean     0.197159
    Sub-Saharan Africa            0.130659
    Name: attractivity_score, dtype: float64
    In [134]:
    fig= px.choropleth(scaled_data.reset_index(),               
                  locations="country_code",               
                  color="attractivity_score",
                  # hover_name="country_name",
                  hover_data=["region", "country_name"],
                  animation_frame=None,    
                  color_continuous_scale='Plasma',  
                  height=500,
                  width=None,
                  title="Country - Attractivity score",
                  #range_color=(0, 12),
    )
    fig.show()
    #fig.write_image("images/map_attractivity.png")
    
    In [133]:
    #!pip install -U kaleido
    #top5 of country byy Region
    g = (scaled_data.groupby("region", as_index=False)
         .apply(lambda g: g.nlargest(5, columns=["attractivity_score"], keep="all"))
         .pipe((sns.catplot, "data"), 
               kind="bar", orient="h", col="region", col_wrap=2,
               y="country_name", x="attractivity_score", estimator=np.mean,
               height=3, aspect=1.5, palette="husl", sharex=False, sharey=False
              )
        )
    # save fig
    #plt.savefig("images/top_country_per_region.png")
    
    C:\Users\sareg\anaconda3\lib\site-packages\seaborn\categorical.py:3808: UserWarning:
    
    Setting `sharey=False` with `color=None` may cause different levels of the `y` variable to share colors. This will change in a future version.